Biostat 203B Homework 1

Due Jan 26, 2024 @ 11:59PM (New deadline Feb 5th Extension due to Mac Repairs)

Author

Kathy Hoang, UID: 506333118

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).

  2. Create a private repository biostat-203b-2024-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; jonathanhori and jasenzhang1 for Lec 80) as your collaborators with write permission.

  3. Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.

  4. After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

  5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market.

Answer: The URL of my Github Repository is https://github.com/kathyhoang25/biostat-203b-2024-winter.git

Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

Answer: I completed the CITI training.

Here is the link to my completion report. Completion Report

Here is the link to my completion certificate. Completion Certificate

I obtained the PhysioNet credential for using the MIMIC-IV data:

PhysioNetCredential.png

Q3. Linux Shell Commands

  1. Make the MIMIC v2.2 data available at location ~/mimic.

Answer I created a symbolic link in my home directory called mimic pointing to the MIMIC v2.2 data folder.

ls -l ~/mimic/
total 48
-rw-rw-r--@  1 kathyhoang  staff  13332 Jan  5  2023 CHANGELOG.txt
-rw-rw-r--@  1 kathyhoang  staff   2518 Jan  5  2023 LICENSE.txt
-rw-rw-r--@  1 kathyhoang  staff   2884 Jan  6  2023 SHA256SUMS.txt
drwxr-xr-x@ 28 kathyhoang  staff    896 Jan 31 01:35 hosp
drwxr-xr-x@ 11 kathyhoang  staff    352 Jan 29 22:23 icu

Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.

Use Bash commands to answer following questions.

  1. Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
ls -l ~/mimic/hosp
total 35858952
-rw-rw-r--@ 1 kathyhoang  staff     73099499 Jan 30 16:33 admissions.csv
-rw-rw-r--@ 1 kathyhoang  staff     15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff        12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-r--r--@ 1 kathyhoang  staff  13730083993 Feb  1 01:28 labevents.csv
-rw-rw-r--@ 1 kathyhoang  staff   1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      9881607 Jan 30 17:39 patients.csv
-rw-rw-r--@ 1 kathyhoang  staff      2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu
total 6155968
-rw-rw-r--@ 1 kathyhoang  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Answer Gzip is a data compression program and .gz refers to the file format that results from using gzip compression. Thus, the data files are distributed as .csv.gz files instead of .csv files because they are compressed files. This compression technology reduces the size of the files and allows data to be transferred more quickly.

  1. Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.

Answer

#man zcat
#man zless
#man zmore
#man zgrep
#note: commented out to avoid long outputs

zcat: allows you to view the contents of a compressed (gzipped) file without having to decompress it. It is similar to the cat command, but it works on zipped files, hence the term zcat.

zless: is a command allows you to view the contents of a compressed file one screen at a time without having to decompress it. The biggest difference between the zless command and the zmore command is that zless offers more functionality and is a bit faster since it does not load the entire file into memory before it displays the output (less is more). It is similar to the less command, but it works on zipped files, hence the term zless.

zmore: is a command that allows you to view the contents of a compressed file one screen at a time without having to decompress it. It is similar to the more command, but it works on zipped files, hence the term zmore.

zgrep: is a command that allows you to search compressed files for a regular expression (matching patterns) without having to decompress it. It invokes the grep command, which is short for “global regular expression print”, on compressed/gzipped files.

  1. (Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  ls -l $datafile
done

AnswerThe bash script loops through all the files in the hosp folder to find files that begin with a, l, or pa and end with .gz. The output is the resulting list of three files in the hosp folder that met the specified criteria: admissions.csv.gz, labevents.csv.gz, and patients.csv.gz.

Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)

Answer

for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  zcat < $datafile | wc -l
done
  431232
 118171368
  299713
  1. Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)

Answer

#Display the first 5 lines of admissions.csv.gz
for files in ~/mimic/hosp/admissions.csv.gz
do
  zmore < $files | head -n 5
done
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
#Number of rows in admissions.csv.gz
num_rows=$(zless < ~/mimic/hosp/admissions.csv.gz | wc -l)
echo There are $num_rows rows in admissions.csv.gz.
There are 431232 rows in admissions.csv.gz.
#Unique Patients in admissions.csv.gz
num_uniq_pat=$(zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F, '{print $1}' | sort | uniq | wc -l)
echo There are $num_uniq_pat unique patients in admissions.csv.gz.

#Scratchwork
#Check that the column selected is subject_id
#zcat < ~/mimic/hosp/admissions.csv.gz | awk -F, '{print $1}' | head
#subject_id is the first column so print $1
#tail -n +2 skips the first row, which is the header/column names so we don't want to include this in our count
## + means line numbering should start from here
#zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | head -n 5
There are 180733 unique patients in admissions.csv.gz.
#Unique Patients in patients.csv.gz
num_pat=$(zcat < ~/mimic/hosp/patients.csv.gz | tail -n +2 | awk -F, '{print $1}' | sort | uniq | wc -l) 
echo There are $num_pat unique patients in patients.csv.gz.

#Scratchwork
#Print first few lines of patients.csv.gz using head
#zcat < ~/mimic/hosp/patients.csv.gz | head
#subject_id is the first column so print $1
There are 299712 unique patients in patients.csv.gz.

The number of unique patients in the admissions.csv.gz file, 180733, does not match the number of patients listed in the patients.csv.gz file, 299712. This is because the patients listed in the admissions.csv.gz file represent only the patients who have been admitted to the hospital, while the patients listed in the patients.csv.gz file represent all of the patients in the database.

  1. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)

Answer Admission Type is of data type: VARCHAR(40) NOT NULL and there are 9 possible values:‘AMBULATORY OBSERVATION’, ‘DIRECT EMER.’, ‘DIRECT OBSERVATION’, ‘ELECTIVE’, ‘EU OBSERVATION’, ‘EW EMER.’, ‘OBSERVATION ADMIT’, ‘SURGICAL SAME DAY ADMISSION’, and ‘URGENT’. These are the counts for each unique value:

#zless < ~/mimic/hosp/admissions.csv.gz | head -n 5
#column numbers: 6 (admission_type), 8 (admission_location), 10(insurance), 13(race)
#Note: uniq -c counts the number of occurrences of each unique value

#Admission Type
zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F, '{print $6}' | sort | uniq -c
6626 AMBULATORY OBSERVATION
19554 DIRECT EMER.
18707 DIRECT OBSERVATION
10565 ELECTIVE
94776 EU OBSERVATION
149413 EW EMER.
52668 OBSERVATION ADMIT
34231 SURGICAL SAME DAY ADMISSION
44691 URGENT

Admission Location is of data type VARCHAR(60) and there are 11 possible values: ‘AMBULATORY SURGERY TRANSFER’, ‘CLINIC REFERRAL’, ‘EMERGENCY ROOM’, ‘INFORMATION NOT AVAILABLE’, ‘INTERNAL TRANSFER TO OR FROM PSYCH’, ‘PACU’, ‘PHYSICIAN REFERRAL’, ‘PROCEDURE SITE’, ‘TRANSFER FROM HOSPITAL’, ‘TRANSFER FROM SKILLED NURSING FACILITY’, and ‘WALK-IN/SELF REFERRAL’. These are the counts for each unique value:

#Admission Location
zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F, '{print $8}' | sort | uniq -c
 185 AMBULATORY SURGERY TRANSFER
10008 CLINIC REFERRAL
232595 EMERGENCY ROOM
 359 INFORMATION NOT AVAILABLE
4205 INTERNAL TRANSFER TO OR FROM PSYCH
5479 PACU
114963 PHYSICIAN REFERRAL
7804 PROCEDURE SITE
35974 TRANSFER FROM HOSPITAL
3843 TRANSFER FROM SKILLED NURSING FACILITY
15816 WALK-IN/SELF REFERRAL

Insurance is of data type VARCHAR(255) and there are 3 possible values: ‘Medicaid’, ‘Medicare’, and ‘Other’. These are the counts for each unique value:

#Insurance
zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F, '{print $10}' | sort | uniq -c
41330 Medicaid
160560 Medicare
229341 Other

Race is of data type VARCHAR(80) and there are 33 possible values. These are the possible values and the counts for each unique value:

#RACE
zless < ~/mimic/hosp/admissions.csv.gz | tail -n +2 | awk -F, '{print $13}' | sort | uniq -c
 919 AMERICAN INDIAN/ALASKA NATIVE
6156 ASIAN
1198 ASIAN - ASIAN INDIAN
5587 ASIAN - CHINESE
 506 ASIAN - KOREAN
1446 ASIAN - SOUTH EAST ASIAN
2530 BLACK/AFRICAN
59959 BLACK/AFRICAN AMERICAN
4765 BLACK/CAPE VERDEAN
2704 BLACK/CARIBBEAN ISLAND
7754 HISPANIC OR LATINO
 437 HISPANIC/LATINO - CENTRAL AMERICAN
 639 HISPANIC/LATINO - COLUMBIAN
 500 HISPANIC/LATINO - CUBAN
4383 HISPANIC/LATINO - DOMINICAN
1330 HISPANIC/LATINO - GUATEMALAN
 536 HISPANIC/LATINO - HONDURAN
 665 HISPANIC/LATINO - MEXICAN
8076 HISPANIC/LATINO - PUERTO RICAN
 892 HISPANIC/LATINO - SALVADORAN
 560 MULTIPLE RACE/ETHNICITY
 386 NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER
15102 OTHER
1761 PATIENT DECLINED TO ANSWER
1510 PORTUGUESE
 505 SOUTH AMERICAN
1603 UNABLE TO OBTAIN
10668 UNKNOWN
272932 WHITE
1103 WHITE - BRAZILIAN
1170 WHITE - EASTERN EUROPEAN
7925 WHITE - OTHER EUROPEAN
5024 WHITE - RUSSIAN
  1. To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)

Answer

#Size of gz file
ls -l ~/mimic/hosp/labevents.csv.gz
#1939088924 bytes
ls -lh ~/mimic/hosp/labevents.csv.gz
#1.8 GB

#Size of uncompressed file 
gzip -dk < ~/mimic/hosp/labevents.csv.gz > ~/mimic/hosp/labevents.csv
ls -l ~/mimic/hosp/labevents.csv
#13730083993 bytes
ls -lh ~/mimic/hosp/labevents.csv
#13 GB

#Note: delete after exercise: ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff  1939088924 Jan  5  2023 /Users/kathyhoang/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   1.8G Jan  5  2023 /Users/kathyhoang/mimic/hosp/labevents.csv.gz
-rw-r--r--@ 1 kathyhoang  staff  13730083993 Feb  1 02:02 /Users/kathyhoang/mimic/hosp/labevents.csv
-rw-r--r--@ 1 kathyhoang  staff    13G Feb  1 02:02 /Users/kathyhoang/mimic/hosp/labevents.csv
#run time for compressed
time zcat < ~/mimic/hosp/labevents.csv.gz | wc -l

#run time for uncompressed

time wc -l ~/mimic/hosp/labevents.csv
 118171368

real    0m14.010s
user    0m21.879s
sys 0m1.248s
 118171368 /Users/kathyhoang/mimic/hosp/labevents.csv

real    0m14.512s
user    0m12.999s
sys 0m1.180s

The uncompressed files are significantly larger than the compressed files. The uncompressed file labevents.csv is 1373008399 3bytes which is about 13 GB, in contrast to the compressed file ‘labevents.csv.gz’ is only 1939088924 bytes which is about 1.8GB.

The runtime output shows that ‘zcat < ~/mimic/labevents.csv.gz | wc -l’ took slightly longer than ‘wc -l ~/mimic/labevents.csv’. The difference in run time was more noticeable for user and system time, but real time is relatively the same.

The tradeoff between storage and speed between these compressed and uncompressed large data files is that compressed files save a lot more storage space but require a longer processing time, but uncompressed files take up an excessive amount of storage but are faster to process. However, this difference in processing time is not that significant for data files of this size.

Q5. More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

Answer Please refer to the comments in the code below for the interpretation of the results. The comments above correspond to the command below it.

#shows just this month's calendar
cal

#shows the calendar for the year 2024
cal 2024

#shows the calendar for specified month and year: September 1752
cal 9 1752
#September 1752 is unusual because it is missing dates Sept 3-13. 
#After doing further research, it turns out that the 11 days missing is 
#due to the delayed adoption of the Gregorian calendar by the British Empire 
#of the Gregorian calendar. These days were skipped as a result of
#political and religious resistance to the calendar change 
#proposed by the Catholic Pope known as the "Calendar Riots of 1752".

#Shows the current date and time
date

#Shows the hostname of the machine (my computer) 
#Ex. CLICC-M-4739, my laptop is lent from UCLA's CLICC 
#but normally it would be my name kathyhoang
hostname

#Shows the architecture of the machine Ex.arm64
arch

# Shows detailed system information for the machine, including the kernel 
#version and the timestamp when the kernel was built, machine architecture
#(arm64), and additional details about the kernel such as the 
#root and version information, etc
uname -a

#Shows the current time, how long the system has been running, 
#the number of users currently logged on, and the
#system load averages for the past 1, 5, and 15 minutes
uptime

#Shows the current user and current time
who am i

#Shows the current users logged in and when they logged in
who

#Shows the current users logged in and information about the users' activities
w

#Shows user and group information, including the current user's UID and GID
id

# Last displays previous logins and head limits it to the most recent 10 logins
last | head

#This line creates a combination of words, uses the curly braces and comma to 
#make various permutations. The first set of curly braces contains prefix options, 
#the second set of curly braces contains the middle part of the word, and the 
#third set of curly braces contains the suffix options that will be appended to
#the end. The comma separates the word options within the curly braces.
echo {con,pre}{sent,fer}{s,ed}

#Time measures the amount of time it takes to execute the sleep 5 command in
#seconds. The sleep 5 command is used to pause the script for 5 seconds. 
#Thus, the output is about 5 seconds in real time, the user time is 0.001 seconds, 
#and the system time is 0.002 seconds.
time sleep 5

#Shows the last 10 commands that were executed in the terminal
history | tail

Q6. Book

  1. Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine.

  2. Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)

The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.

For grading purpose, include a screenshot of Section 4.1.5 of the book here.

AnswerI was able to build the book. Here is a screenshot of Section 4.1.5 of the book:

Section4.1.5